Split-\(\widehat{R}\) and effective sample size, are routinely used to monitor the convergence of iterative simulations. \(\widehat{R}\) (Gelman and Rubin, 1992; Brooks and Gelman, 1998) and split-\(\widehat{R}\) (Gelman et al., 2013) are based on the ratio of between and within-sequence marginal variances of the simulations. The effective sample size estimate (\(N_{\rm eff}\)) by Gelman et al. (2013) is based on marginal variance and autocorrelation estimates from split chains.
\(\widehat{R}\), split-\(\widehat{R}\), and \(N_{\rm eff}\) are well defined only if the marginal posteriors have finite mean and variance, and even in that case they are less stable for distributions with long tails.
To alleviate these problems, we propose
We also propose
We review split-\(\widehat{R}\) and effective sample size estimated as implemented in Stan 2.18 (Carpenter et al., 2017; Stan Development Team, 2018c).
Here are the formulas, with the presentation taken from Gelman et al. (2013). For each scalar summary of interest \(\psi\), we compute \(B\) and \(W\), the between- and within-sequence variances: \[ B = \frac{n}{m-1}\sum_{j=1}^{m}(\overline{\psi}_{.j}-\overline{\psi}_{..})^2, \;\mbox{ where }\;\;\overline{\psi}_{.j}=\frac{1}{n}\sum_{i=1}^n \psi_{ij},\;\; \;\;\overline{\psi}_{..} = \frac{1}{m}\sum_{j=1}^m\overline{\psi}_{.j}\\ W = \frac{1}{m}\sum_{j=1}^{m}s_j^2, \;\mbox{ where }\;\; s_j^2=\frac{1}{n-1} \sum_{i=1}^n (\psi_{ij}-\overline{\psi}_{.j})^2. \]
The between-sequence variance, \(B\), contains a factor of \(n\) because it is based on the variance of the within-sequence means, \(\overline{\psi}_{.j}\), each of which is an average of \(n\) values \(\psi_{ij}\).
We can estimate \(\mbox{var}(\psi|y)\), the marginal posterior variance of the estimand, by a weighted average of \(W\) and \(B\), namely \[ \widehat{\mbox{var}}^+(\psi|y)=\frac{n-1}{n}W + \frac{1}{n}B. \] This quantity overestimates the marginal posterior variance assuming the starting distribution is appropriately overdispersed, but is unbiased under stationarity (that is, if the starting distribution equals the target distribution), or in the limit \(n\rightarrow\infty\). To have overdispersed starting distribution, independent Markov chains should be initialized with diffuse starting values for the parameters.
Meanwhile, for any finite \(n\), the ``within’’ variance \(W\) should be an underestimate of \(\mbox{var}(\psi|y)\) because the individual sequences have not had time to range over all of the target distribution and, as a result, will have less variability; in the limit as \(n\rightarrow\infty\), the expectation of \(W\) approaches \(\mbox{var}(\psi|y)\).
We monitor convergence of the iterative simulation by estimating the factor by which the scale of the current distribution for \(\psi\) might be reduced if the simulations were continued in the limit \(n\rightarrow\infty\). This potential scale reduction is estimated for monitoring convergence of iterative simulation by \[ \widehat{R}= \sqrt{\frac{\widehat{\mbox{var}}^+(\psi|y)}{W}}, \] which declines to 1 as \(n\rightarrow\infty\).
We call this split-\(\widehat{R}\) because we are applying it to chains that have been split in half. Without splitting, \(\widehat{R}\) would get fooled by nonstationary chains.
If the \(n\) simulation draws within each sequence were truly independent, then the between-sequence variance \(B\) would be an unbiased estimate of the posterior variance, \(\mbox{var}(\psi|y)\), and we would have a total of \(mn\) independent simulations from the \(m\) sequences. In general, however, the simulations of \(\psi\) within each sequence will be autocorrelated, and \(B\) will be larger than \(\mbox{var}(\psi|y)\), in expectation.
A nice introductory reference for analyzing MCMC results in general and effective sample size in particular is (see also Geyer, 1992, 2011). The particular calculations used by Stan (Stan Development Team, 2018c) follow those for split-\(\hat{R}\), which involve both cross-chain (mean) and within-chain calculations (autocorrelation); they were introduced in Stan manual and explained in more detail in Gelman et al. (2013).
One way to define effective sample size for correlated simulation draws is to consider the statistical efficiency of the average of the simulations \(\bar{\psi}_{..}\), as an estimate of the posterior mean, \(\mbox{E}(\psi|y)\). This can be a reasonable baseline even though is not the only possible summary and might be inappropriate, for example, if there is particular interest in accurate representation of low-probability events in the tails of the distribution.
The effective sample size of a sequence is defined in terms of the autocorrelations within the sequence at different lags. The autocorrelation \(\rho_t\) at lag \(t \geq 0\) for a chain with joint probability function \(p(\theta)\) with mean \(\mu\) and variance \(\sigma^2\) is defined to be \[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} (\theta^{(n)} - \mu) (\theta^{(n+t)} - \mu) \, p(\theta) \, d\theta. \] This is just the correlation between the two chains offset by \(t\) positions. Because we know \(\theta^{(n)}\) and \(\theta^{(n+t)}\) have the same marginal distribution in an MCMC setting, multiplying the two difference terms and reducing yields \[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} \theta^{(n)} \, \theta^{(n+t)} \, p(\theta) \, d\theta. \]
The effective sample size of \(N\) samples generated by a process with autocorrelations \(\rho_t\) is defined by \[ N_{\rm eff} \ = \ \frac{N}{\sum_{t = -\infty}^{\infty} \rho_t} \ = \ \frac{N}{1 + 2 \sum_{t = 1}^{\infty} \rho_t}. \]
Effective sample size \(N_{\rm eff}\) can be larger than \(N\) in case of antithetic Markov chains, which have negative autocorrelations on odd lags. Dynamic Hamiltonian algorithm (Hoffman and Gelman, 2014) used in Stan can produce \(N_{\rm eff}>N\) for parameters which have close to Gaussian posterior with low dependency on other parameters (Stan before 2.18 computed \(N_{\rm eff}\) so that the maximum reported value was \(N\)).
In practice, the probability function in question cannot be tractably integrated and thus the autocorrelation cannot be calculated, nor the effective sample size. Instead, these quantities must be estimated from the samples themselves. The rest of this section describes a autocorrelations and split-\(\hat{R}\) based effective sample size estimator, based on multiple split chains. For simplicity, each chain \(\theta_m\) will be assumed to be of length \(N\).
Stan carries out the autocorrelation computations for all lags simultaneously using Eigen’s fast Fourier transform (FFT) package with appropriate padding; see Geyer (2011) for more detail on using FFT for autocorrelation calculations. The autocorrelation estimates \(\hat{\rho}_{t,m}\) at lag \(t\) from multiple chains \(m \in (1,\ldots,M)\) are combined with within-sample variance estimate \(W\) and multi-chain variance estimate \(\widehat{\mbox{var}}^{+}\) introduced in the previous section to compute the combined autocorrelation at lag \(t\) as \[ \hat{\rho}_t = 1 - \frac{\displaystyle W - \textstyle \frac{1}{M}\sum_{m=1}^M \hat{\rho}_{t,m}}{\widehat{\mbox{var}}^{+}}. \label{rhohat} \] If the chains have not converged, the variance estimator \(\widehat{\mbox{var}}^{+}\) will overestimate variance, leading to an overestimate of autocorrelation and an underestimate effective sample size.
Because of the noise in the correlation estimates \(\hat{\rho}_t\) as \(t\) increases, typical truncated sum of \(\hat{\rho}_t\) is used. Negative autocorrelations can happen only on odd lags and by summing over pairs starting from lag 0, the paired autocorrelation is guaranteed to be positive, monotone and convex modulo estimator noise (Geyer, 1992, 2011). Stan uses Geyer’s initial monotone sequence criterion. The effective sample size estimator is defined as \[ \hat{N}_{\rm eff} = \frac{MN} {\hat{\tau}}, \] where \[ \hat{\tau} = 1 + 2 \sum_{t=1}^{2m+1} \hat{\rho}_t = -1 + 2 \sum_{t'=0}^{m} \hat{P}_{t'}, \] where \(\hat{P}_{t'}=\hat{\rho}_{2t'}+\hat{\rho}_{2t'+1}\). Initial positive sequence estimators is obtained by choosing the largest \(m\) such that \(\hat{P}_{t'}>0, \quad t' = 1,\ldots,m\). The initial monotone sequence is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding ones so that the estimated sequence is monotone.
Our \(\hat{N}_{\rm eff}\) is slightly different from similar formulas in the literature in that we use also between-sequence variance in the computation which typically gives us more conservative claims (lower values of \(N_{\rm eff}\)) compared to purely between-sequence estimates, especially when mixing is poor. If the chains are not mixing at all (e.g. the posterior is multimodal and the chains are stuck in different modes), then our \(\hat{N}_{\rm eff}\) is close to the number of chains.
In Stan pre 2.18 slightly wrong initial sequence was used, which had the effect that \(\hat{N}_{\rm eff}\) estimate was capped at \(N\). As antithetic behavior \(\hat{N}_{\rm eff}>N\) is not that common or the effect is small and capping at \(N\) can be considered to be pessismistic, this had negligible effect on any inference.
As split-\(\widehat{R}\), and \(N_{\rm eff}\) are well defined only if the marginal posteriors have finite mean and variance, we next describe approaches which are well defined for all distributions and more robust for long tailed distributions.
Rank normalization is commonly used in Spearman’s rank correlation coefficient, but also to transform predictor values in epidemiology. Rank normalization replaces each draw by its rank (or fractional rank which the rank divided by the total number of observations). We use average rank for ties to conserve the number of values the same in case of discrete quantities. Rank normalization is made jointly for all draws from all chains.
Rank normalization makes the subsequent computations well defined and invariant under bijective transformations. This means that we get same results, for example, if we use unconstrained or constrained parameters. Furthermore, to improve sensitivity to scale differences between chains, we consider folded rank-normalization which uses rank distance from the median.
Split-\(\widehat{R}\) can miss if chains have different scales but have the same location. To alleviate this we propose to to compute Split-\(\widehat{R}\) also for folded rank normalized values. Folded values are obtained by absolute deviation transformation \[\begin{equation} {\rm abs}(y-{\rm median}(y)), \end{equation}\]where median is compute from all draws from all chains. Folded rank normalized values are then obtained by rank normalizing the folded values.
For simplicity, we propose to report the maximum of rank-normalized-split-\(\widehat{R}\) and folded-rank-normalize-split-\(\widehat{R}\) values.
In additional of using rank-normalized values for convergence diagnostics, we can compute also effective sample size using them. This is estimate will be well defined even if the original distribution does not have finite mean and variance, but it is not directly applicabe to estimate the Monte Carlo error of the expectation of the original value. We could transform quantiles the Monte Carlo error of the mean rank normalized value back to the original scale, but for simplicity we propose to report (bijective) transformation invariant relative efficiency values \($ R_{\rm eff}=\textit{rank-normalized-split-}N_{\rm eff} / N\), where \(N\) is the total number of draws.
Since the marginal posterior distributions might not have finite mean and variance, by default RStan (Stan Development Team, 2018a) and RStanARM (Stan Development Team, 2018b) report median and median absolute deviation (MAD) instead of mean and standard error (SE). Median and MAD are well defined also when the marginal distribution does not have finite mean and variance.
If the median of \(x\) is \(\tilde{x}\), then by definition \(p(x < \tilde{x}) = 0.5\). We can’t present this as a simple expectation, but we can estimate it by \(\frac{1}{S}\sum I_{x<x'}\). The indicator function transforms simulation draws to 0’s and 1’s, and thus the subsequent computations are bijectively invariant. Instead of estimating the efficiency of the median directly, we can estimate the efficiency of \(p(x < \hat{Q}_{0.5})\), where \(\hat{Q}_{0.5}\) is the empirical median from the combined draws. This should be good approximation, if the efficiency for \(p(x < \hat{Q}_{\alpha})\) is smoothly changing around \(\alpha = 0.5\).
Compared to relative efficiency of mean rank, this is more local as it measures only how the simulation jumps around median.
We can also compute the approximative relative efficiency for the median absolute deviation (MAD), by computing the relative efficiency for median of absolute deviations from the median of all draws. The absolute deviations from the median of all draws are same as previously defined folded samples \[ {\rm abs}(y-{\rm median}(y)). \] We see that relative efficiency of MAD is obtained by using the same approach as for median but using the folded values used also in folded-rank-normalized-split-\(\widehat{R}\).
Previously Stan has reported \(N_{\rm eff}\) only for the posterior location (previously mean), but we demonstrate that it is good to report the relative efficiency also for scale (we propose to use MAD).
We can compute the relative efficiency also for other quantiles than the median. Some tail quantiles such as the 5% quantile or the 95% quantile are often used to construct credible intervals. Estimating the relative efficiency of such quantiles thus has high practical relevance in particular as we observe the relative efficiency for tail quantiles is often lower than for mean or median. By computing relative efficiency for many quantiles we can obtain local efficiency measure demonstrated in the experiments.
We can get even more local relative efficiency estimate by considering small intervals. We propose to compute the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+\delta})\), where \(\hat{Q}_\alpha\) is the empirical \(\alpha\)-quantile, \(\delta=1/M\) is the length of the interval with some positive integer \(M\) chosen by the researcher, and \(\alpha \in (0,\delta,\ldots,1-\delta)\) changes in steps of \(\delta\). Each interval has \(N/M\) draws, and the efficiency measures autocorrelation of an indicator function which indicates when the values are inside the specific interval. This gives us a local efficiency measure which does not depend on the shape of the distribution. We demonstrate its usefulness as a diagnostic and educational tool to teach why expectations of different functionals may have different \(N_{\rm eff}\).
We note that we can plot similar local relative efficiency plots also for distributional approximations (like Laplace and ADVI (Kucukelbir et al., 2017) approximations), using Pareto smoothed importance sampling based effective sample size estimates (Vehtari, Gelman and Gabry, 2017). There will be another case study for this.
In additional of using rank-normalized values to compute split-\(\widehat{R}\), we propose to use rank plots for each chain instead of traces. These can be used instead of trace plots, which in case of long chains tend to squeeze to a mess. We illustrate that rank plots can be used as a more compact alternative to trace plots.
We intentionally separate MCMC diagnostic based on generic and stable rank-normalized relative efficiency \(\hat{n}_{\rm eff}/n\) and Monte Carlo error estimates for the desired expectations, which is applicable only when mean and variance exist. We could use Pareto \(k\) diagnostic to estimate the existance of mean and variance (Vehtari, Gelman and Gabry, 2017). There will be another case study for diagnostics to check when certain marginal has finite mean and variance needed in order to report posterior mean and MCSE.
In the code and figures the following abbreviations have been used
The proposal is to switch in Stan
Justifications for the change are:
rank normalization makes zsRhat and zsreff well defined for all distributions, bijective transformation invariant and more stable than sRhat and neff. Adding folded version fzsRhat helps to detect differences in scale.
CDF based medsreff and masreff are well defined for all distributions, bijective transformation invariant, more stable than neff, and focuses on the median and mad_sd we are reporting.
In displays we propose eventually to use Rhat to denote the new version, but to make it more clear that rank-normalized efficiency is different we would switch to use R_eff and display relative efficiency. The relative efficiency is also easier to check for low values as we don’t need to compare it to the total number of draws.
Based on the experiments, more strict convergence diagnostic and relative efficiency warning limits could be used.
We propose the following warning thresholds, although additional experiments would be useful:
Plots shown in the upcoming section have dashed lines based on these thresholds (in continuous plots, a dashed line at 1.005 is plotted instead of 1.01, as values larger than that are usually rounded in our summaries to 1.01).
We demonstrated the proposed approached with several simulation studies.
The code for alternative new split-\(\widehat{R}\) and \(N_{\rm eff}\) versions are in monitornew.R.
Load packages
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(bayesplot)
library(tidyverse)
library(gridExtra)
source('monitornew.R')
monitor_simple <- function(sims, ...) {
out <- monitornew(sims, warmup = 0, probs = 0.5, print = FALSE, ...)
out <- as.data.frame(out)
out$par <- seq_len(nrow(out))
out
}
This part focuses on diagnostics for
To simplify, in this part independent draws are used as a proxy for very efficient MCMC and we sample draws from a standard-normal distribution. We will discuss the behavior for non-Gaussian distributions later.
conds <- expand.grid(
iters = c(250, 1000, 4000),
trend = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
trend <- conds[i, "trend"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r <- r + seq(-trend, trend, length.out = iters)
rs <- monitor_simple(r)
res[[i]] <- cbind(iters, trend, rep, rs)
}
res <- bind_rows(res)
If we don’t split chains, Rhat misses the trends if all chains still have similar marginal distribution.
ggplot(data=res, aes(y=Rhat, x=trend)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Rhat without splitting chains')
Split-Rhat can detect trends, even if the marginals of the chains are similar.
ggplot(data=res, aes(y=zsRhat, x=trend)) +
geom_point() + geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: Split-Rhat is useful for detecting non-stationarity in the chains. If we use a threshold of \(1.01\), we can detect trends which have acount for 2% of the total marginal variance. If we use a threshold of \(1.1\), we can detect trends which acount for 30% of the total marginal variance.
Relative efficiency (effective sample size divided by the number of draws) is based on split Rhat and within-chain autocorrelation.
ggplot(data=res, aes(y=zsreff, x=trend)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
ggtitle('Relative efficiency (zsreff)') +
scale_y_continuous(breaks = seq(0,1,by=0.25))
Result: Split-Rhat is more sensitive to trends in small sample sizes, but relative efficiency is more sensitive with larger samples sizes (as autocorrelations can be estimated more accurately).
Advice: If in doubt, run longer chains for more accurate estimates.
Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has a different mean.
conds <- expand.grid(
iters = c(250, 1000, 4000),
shift = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
shift <- conds[i, "shift"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] + shift
rs <- monitor_simple(r)
res[[i]] <- cbind(iters, shift, rep, rs)
}
res <- bind_rows(res)
ggplot(data=res, aes(y=zsRhat, x=shift)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: If we use a threshold of \(1.01\), we can detect shift with a magnitude of one third of the marginal standard deviation. If we use a threshold of \(1.1\), we can detect shift with a magnitude equal to the marginal standard deviation. The rank plot can be used to visualize where the problem is.
ggplot(data=res, aes(y=zsreff, x=shift)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
ggtitle('Relative efficiency (zsreff)') +
scale_y_continuous(breaks = seq(0,1,by=0.25))
Result: The relative efficiency is not as sensitive, but a shift with a magnitude of half the marginal standard deviation will lead to very low efficiency when sample size increases.
Rank plot visualisation for a case with 4 chains and 250 draws per chain, and shift = 0.5.
iters = 250
chains = 4
shift = 0.5
r <- array(rnorm(iters * chains), c(iters, chains))
r[,1] <- r[,1] + shift
colnames(r) <- 1:4
mcmc_hist_r_scale <- function(x, nbreaks = 50) {
max <- prod(dim(x)[1:2])
mcmc_hist(
r_scale(x),
breaks = seq(0, max, by = max / nbreaks) + 0.5
)
}
mcmc_hist_r_scale(r)
Rhat is less than \(1.05\), but the rank plot clearly shows the location of the problem (first chain).
Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others but has lower marginal variance.
conds <- expand.grid(
iters = c(250, 1000, 4000),
scale = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
scale <- conds[i, "scale"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
rs <- monitor_simple(r)
res[[i]] <- cbind(iters, scale, rep, rs)
}
res <- bind_rows(res)
ggplot(data=res, aes(y=zsRhat, x=scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: Split-Rhat is not able to detect scale differences between chains.
ggplot(data=res, aes(y=fzsRhat, x=scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Folded-split-Rhat')
Result: Folded-Split-Rhat focuses explicitely on scales and can thus detect scale differences.
Result: If we use a threshold of \(1.01\), we can detect a chain with scale less than \(3/4\) of the standard deviation of the others. If we use threshold of \(1.1\), we can detect a chain with standard deviation less than \(1/4\) of the standard deviation of the others.
ggplot(data=res, aes(y=zsreff, x=scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
ggtitle('Relative efficiency (zsreff)') +
scale_y_continuous(breaks = seq(0,1,by=0.25))
Result: The relative efficiency does not see a problem as it focuses on location differences between chains.
ggplot(data=res, aes(y=fzsreff, x=scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
ggtitle('Folded relative efficiency (fzsreff)') +
scale_y_continuous(breaks = seq(0,1,by=0.25))
Result: The relative efficiency of the standard deviation does see the problem as it focuses on scale.
Rank plot visualisation for a case with 4 chains and 250 draws per chain, and with one chain having a standard deviation of 0.75 as opposed to a standard deviation of 1 for the other chains.
iters = 250
chains = 4
scale = 0.75
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
colnames(r) <- 1:4
mcmc_hist_r_scale(r)
Folded Rhat is \(1.06\), but the rank plot clearly shows where the problem is.
The classic Rhat is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e. infinite), the classic Rhat is not well justified. In this section, we will use the Cauchy distribution as an example of such distribution. Also in cases where mean and variance are finite, the distribution can be far from Gaussian. Especially distributions with very long tails cause instability for variance and autocorrelation estimates. To alleviate these problems we will use Split-Rhat for rank-normalized draws. Ranks are computed by using average for ties, as this is a deterministic procedure and constant chains are mapped to constant chains.
The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution
The nominal cauchy model with direct parameterization is as follows.
writeLines(readLines("cauchy_nom.stan"))
parameters {
vector[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the nominal model:
fit_nom <- stan(file='cauchy_nom.stan', seed=7878, refresh = 0)
Warning: There were 1749 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
The diagnostics warn about very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.
MCMC trace for the first parameter looks wild with occasional large values
samp <- as.array(fit_nom)
mcmc_trace(samp[, , 1])
Let’s check Rhat and relative efficiency diagnostics.
res <- monitor_simple(samp[, , 1:50])
which_min_eff <- which(res$zsreff == min(res$zsreff))
plot_rhat <- function(res) {
p1 <- ggplot(res, aes(x=par, y=sRhat)) +
geom_point() +
ggtitle('Classic Split-Rhat') +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ylim(c(.99 ,1.26))
p2 <- ggplot(res, aes(x=par, y=zsRhat)) +
geom_point() +
ggtitle('Rank normalized split-Rhat') +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ylim(c(.99,1.26))
p3 <- ggplot(res, aes(x=par, y=fzsRhat)) +
geom_point() +
ggtitle('Folded rank split-Rhat') +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ylim(c(.99,1.26))
grid.arrange(p1, p2, p3, nrow = 1)
}
plot_rhat(res)
For one parameter, Rhats exceed the classic threshold of 1.1. Depending on the Rhat estimate, a few others also exceed the threshold 1.01. The rank normalized split-Rhat have several values over 1.01. Please note that the classic Rhat is not well defined in this example.
plot_reff <- function(res) {
ymax <- 2
ybreaks <- seq(0, ymax, by = 0.25)
ylimits <- c(0, ymax)
p1 <- ggplot(res, aes(x=par, y=reff)) +
geom_point() +
ggtitle('Classic rel. efficiency (reff)') +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
scale_y_continuous(breaks = ybreaks, limits = ylimits)
p2 <- ggplot(res, aes(x=par, y=zsreff)) +
geom_point() +
ggtitle('New rel. efficiency (zsreff)') +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
scale_y_continuous(breaks = ybreaks, limits = ylimits)
p3 <- ggplot(res, aes(x=par, y=fzsreff)) +
geom_point() +
ggtitle('Folded rel. efficiency (fzsreff)') +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
scale_y_continuous(breaks = ybreaks, limits = ylimits)
p4 <- ggplot(res, aes(x=par, y=medsreff)) +
geom_point() +
ggtitle('Rel.eff. of median (medsreff)') +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
scale_y_continuous(breaks = ybreaks, limits = ylimits)
p5 <- ggplot(res, aes(x=1:50, y=madsreff)) +
geom_point() +
ggtitle('Rel.eff. of mad (madsreff)') +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
scale_y_continuous(breaks = ybreaks, limits = ylimits)
blank <- grid::grid.rect(gp = grid::gpar(col="white"), draw = FALSE)
grid.arrange(p1, p2, p3, blank, p4, p5, nrow = 2)
}
plot_reff(res)
Both classic and rank normalized relative efficiency estimates have several near zero values, and the overall samples shouldn’t be trusted.
Result: Relative efficiency is more sensitive than (rank-normalized) split-Rhat to detect problems of slow mixing.
We also check the indicator function I and lp__ and find out that the relative efficiency for lp__ is worryingly low.
res <- monitor_simple(samp[, , 51:52])
cat('I: r_eff = ', round(res['I','zsreff'], 2), '\n')
I: r_eff = 0.14
cat('lp__: r_eff = ', round(res['lp__','zsreff'], 2), '\n')
lp__: r_eff = 0.05
Next, we examine the relative efficiency in different parts of the posterior by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is the empirical \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\) varies in steps of \(5%\). Each interval contains \(5%\) of the draws and the efficiency measures autocorrelation of an indicator function which indicates when the values are inside the specific interval. This gives us a local efficiency measure which is does not depend on the shape of the distribution.
plot_local_reff <- function(samp = NULL, par = NULL, nalpha = NULL) {
delta <- 1 / nalpha
alphas <- seq(0, 1 - delta, by = delta)
zsreffs <- rep(NA, length(alphas))
for (i in seq_along(alphas)) {
alpha <- alphas[i]
tmp <- samp[, , par]
I <- tmp >= quantile(tmp, alpha) & tmp < quantile(tmp, alpha + delta)
rs <- monitor_simple(I)
zsreffs[i] <- rs$zsreff[1]
}
df <- data.frame(
quantile = seq(0, 1, by = delta),
zsreff = c(zsreffs, zsreffs[nalpha])
)
ggplot(data=df, aes(x=quantile, y=zsreff)) +
geom_step() +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
scale_x_continuous(breaks = seq(0,1, by=0.1)) +
scale_y_continuous(
breaks = seq(0, 1, by=0.25),
limits = c(0, 1.1)
)
}
plot_local_reff(samp = samp, par = 1, nalpha = 20)
We see that MCMC is less efficient in the tails.
Alternative way to examine the relative efficiency in different parts of the posterior, is to compute relative efficiency for \(p(x < \hat{Q}_{\alpha})\), where \(\hat{Q}_\alpha\) is the empirical \(\alpha\)-quantile and \(\alpha \in (0.025,\ldots,0.975)\). This is related to the efficiency of quantile estimates. Each interval has specified proportion of draws, and the efficiency measures autocorrelation of an indicator function which indicates when the values are inside the specific interval. This gives us a local efficiency measure which is indepedent on the shape of the distribution.
plot_quantile_reff <- function(samp = NULL, par = NULL, nalpha=NULL) {
delta <- 1 / nalpha
alphas <- seq(delta, 1 - delta, by = delta)
zsreffs <- rep(NA, length(alphas))
for (i in seq_along(alphas)) {
alpha <- alphas[i]
tmp <- samp[, , par]
I <- tmp < quantile(tmp, alpha)
rs <- monitor_simple(I)
zsreffs[i] <- rs$zsreff[1]
}
df <- data.frame(
quantile = seq(delta, 1 - delta, by = delta),
zsreff = zsreffs
)
ymax <- max(1, round(max(zsreffs, na.rm = TRUE) + 0.15, 1))
ggplot(df, aes(x=quantile, y=zsreff)) +
geom_point() +
geom_hline(yintercept = c(0,1)) +
geom_hline(yintercept = 0.1, linetype = 'dashed') +
scale_x_continuous(breaks = seq(0,1, by=0.1)) +
scale_y_continuous(
breaks = seq(0, ymax, by = 0.25),
limits = c(0, ymax)
)
}
plot_quantile_reff(samp = samp, par = 1, nalpha = 40)
Rank plot visualisation of x[1], which has relative efficiency 0.14.
mcmc_hist_r_scale(samp[, , 1])
The chains clearly have different rank plots.
The rank plot visualisation of x[26], which has a relative efficiency of 0.05, also indicates clear problems with mixing.
xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])
max_treedepth=20We can try to improve the performance by increasing max_treedepth.
fit_nom_td20 <- stan(
file='cauchy_nom.stan', seed=7878,
refresh = 0, control = list(max_treedepth=20)
)
Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
MCMC trace for the first parameter looks wild with occasional large values.
samp <- as.array(fit_nom_td20)
mcmc_trace(samp[, , 1])
res <- monitor_simple(samp[, , 1:50])
which_min_eff <- which(res$zsreff == min(res$zsreff))
Treedepth exceedence and Bayesian Fraction of Missing Information are HMC/NUTS specific diagnostics. Let’s see what generic Split-Rhat and relative efficiency can detect.
We check the diagnostics for all x’s.
plot_rhat(res)
All Rhats are below \(1.1\), but many are over \(1.01\). Classic Rhat has more variation than the rank normalized (and the former is not well defined). The folded rank normalized Rhat shows that there is still more variation in scale than location between different chains.
plot_reff(res)
Some classic relative efficiencies (reffs) are close to zero. If we wouldn’t realize that the variance is infinite, we might try to run longer chains, but in case of infinite variance, zero efficiency is the truth and longer chains won’t help. However other quantities can be well defined, and that’s why it is useful also to look at the rank normalized version as a generic transformation with finite mean and variance. The smallest rank-normalized r_eff are around \(0.25\), which is not that bad. There is still large variation between parameters although they all have the same distribution.
Result: Rank normalized relative efficiency is more stable than classic relative efficiency, which is not well defined for Cauchy.
We check the indicator function I and lp__. Although increasing max_treedepth improved diagnostics for x, the efficiency for I and lp__ didn’t change.
res <- monitor_simple(samp[, , 51:52])
cat('I: r_eff = ', round(res['I','zsreff'],2), '\n')
I: r_eff = 0.13
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.06
We examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\).
plot_local_reff(samp = samp, par = 1, nalpha = 20)
Increasing max_treedepth improved the efficiency slightly in the tails.
We examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(x < \hat{Q}_{\alpha})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.025,\ldots,0.975)\) in steps of \(0.025\).
plot_quantile_reff(samp = samp, par = 1, nalpha = 40)
Rank plot visualisation of x[38], which has the smallest relative efficiency 0 among the x’s. The rank plot indicates clear convergence problems.
xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])
The rank plot visualisation of lp__, which has relative efficiency round(monitor_simple(samp[,, "lp__"])$zsreff), doesn’t look so good either.
mcmc_hist_r_scale(samp[, , "lp__"])
max_treedepth = 20 + 8 times longer samplingLet’s try running longer chains.
fit_nom_td20l <- stan(
file='cauchy_nom.stan', seed=7878,
refresh = 0, control = list(max_treedepth=20),
warmup=1000, iter = 9000
)
Warning: There were 3 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
MCMC trace for the first parameter looks wild with occasional large values
samp <- as.array(fit_nom_td20l)
mcmc_trace(samp[, , 1])
res <- monitor_simple(samp[, , 1:50])
which_min_eff <- which(res$zsreff == min(res$zsreff))
Treedepth exceedence and Bayesian Fraction of Missing Information are HMC/NUTS specific diagnostics. Let’s see what generic Split-Rhat and relative efficiency can detect. Check the diagnostics for all x’s.
plot_rhat(res)
All Rhats are below \(1.01\). Classic Rhat has more variation than the rank normalized Rhat (but the former is not well defined).
plot_reff(res)
Most classic relative efficiencies (reffs) are close to zero. If we wouldn’t realize that the variance is infinite, we might try to run longer chains, but in case of infinite variance zero is the true value and longer chains don’t help. However other quantities can be well defined, and that’s why it can be useful also to look at the rank normalized version as a generic transformation with finite mean and variance. The rank-normalized r_eff’s start to concentrate around 0.6, which is quite good. There is still some variation between parameters.
Result: Rank normalized relative efficiency is more stable than classic relative efficiency, which is not well defined for Cauchy.
We check the indicator function I and lp__:
res <- monitor_simple(samp[, , 51:52])
cat('I: r_eff = ', round(res['I','reff'],2), '\n')
I: r_eff = 0.16
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.04
We examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\) in steps of \(0.05\).
plot_local_reff(samp = samp, par = 1, nalpha = 20)
Increasing chain length did not seem to change the relative efficiency. With more draws from the longer chains we can use finer resolution for the local efficiency estimates.
plot_local_reff(samp = samp, par = 1, nalpha = 100)
The efficiency far in the tails is worryingly low. We see similar problems in the alternative local relative efficiency.
plot_quantile_reff(samp = samp, par = 1, nalpha = 100)
The rank plot visualisation of x[34], which has the smallest relative efficiency 0.04 among the x’s.
xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])
Still problems at the edges.
The rank plot visualisation of lp__, which has relative efficiency 0.04, looks as follows:
mcmc_hist_r_scale(samp[, , "lp__"])
There seems to be slight problems also in mixing for energy.
Next we examine alternative parameterization and consider Cauchy as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed x’s can be computed from those. In addition of improved sampling performance, the example illustrates that focus of diagnostics matter.
writeLines(readLines("cauchy_alt_1.stan"))
parameters {
vector[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a ./ sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the alternative model:
fit_alt1 <- stan(file='cauchy_alt_1.stan', seed=7878, refresh = 0)
samp <- as.array(fit_alt1)
res <- monitor_simple(samp[, , 101:150])
There are no warnings, and the sampling is much faster.
plot_rhat(res)
All Rhats are below 1.01. Classic Rhat’s look also good even if the classic is not well defined for Cauchy distribution.
plot_reff(res)
Result: Rank normalized r_eff’s have less variation than classic one which is not well defined for Cauchy.
We check the indicator function I and lp__:
res <- monitor_simple(samp[, , 151:152])
cat('I: r_eff = ', round(res['I','zsreff'], 2), '\n')
I: r_eff = 0.75
cat('lp__: r_eff = ', round(res['lp__','zsreff'], 2), '\n')
lp__: r_eff = 0.33
The relative efficiency for these are also much better than with nominal parameterization.
We examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha + 0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\).
samp <- as.array(fit_alt1)
plot_local_reff(samp = samp, par = 101, nalpha = 20)
The relative efficiency is good in all parts of the posterior.
We examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(x < \hat{Q}_{\alpha})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.025,\ldots,0.975)\).
samp <- as.array(fit_alt1)
plot_quantile_reff(samp = samp, par = 101, nalpha = 40)
res <- monitor_simple(samp[, , 101:150])
which_min_eff <- which(res$zsreff == min(res$zsreff))
res1 <- monitor_simple(samp[, , 1:50])
res2 <- monitor_simple(samp[, , 51:100])
cat('Mean r_eff for xs = ' , round(mean(res[,'zsreff']), 2), '\n')
Mean r_eff for xs = 1.03
cat('Mean fr_eff for xs = ' , round(mean(res[,'fzsreff']), 2), '\n')
Mean fr_eff for xs = 0.52
cat('Mean r_eff for x_as = ' , round(mean(res1[,'zsreff']), 2), '\n')
Mean r_eff for x_as = 1.15
cat('Mean r_eff for x_bs = ' , round(mean(res2[,'zsreff']), 2), '\n')
Mean r_eff for x_bs = 0.69
Result: We see that the relative efficiency of the interesting \(x\) can be different from the relative efficiency of the parameters that used to compute it. Also r_eff we are computing is for rank normalized and to compute Monte Carlo errors, the final target function should be taken into account, but we just want to avoid the complication of the potentially non-existing moments at this point, and come back later in other paper how to use Pareto diagnostic to check whether necessary moments exist to compute the desired expectations and their Monte Carlo errors.
Rank plot visualisation of x[9], which has the smallest relative efficiency 0.85 among x’s.
xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])
Looks better than for the nominal parameterization.
Rank plot visualisation of lp__, which has a relative efficiency of -81.64, 0.22, 7.87, -81.23, 1298.18, 0.32, 1307.29, 1310.06, 1318.33, 0.33, 1, 1, 1, 1, 1, 2581.71, 0.65, 1514.76, 0.38, 2767.19, 0.69, 1.
mcmc_hist_r_scale(samp[, , "lp__"])
Looks better than for the nominal parameterization. It would be useful to have the reference plotted.
Another alternative parameterization is univariate transformation as shown in the following code (3rd alternative in Michael Betancourt’s case study).
writeLines(readLines("cauchy_alt_3.stan"))
parameters {
vector<lower=0, upper=1>[50] x_tilde;
}
transformed parameters {
vector[50] x = tan(pi() * (x_tilde - 0.5));
}
model {
// Implicit uniform prior on x_tilde
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the alternative model:
fit_alt3 <- stan(file='cauchy_alt_3.stan', seed=7878, refresh = 0)
There are no warnings, and the sampling is much faster than for the nominal model.
samp <- as.array(fit_alt3)
res <- monitor_simple(samp[, , 51:100])
plot_rhat(res)
All Rhats except some folded Rhats are below 1.01. Classic Rhat’s look also good even if it is not well defined for the Cauchy distribution.
plot_reff(res)
Result: Rank normalized relative efficiencies have less variation than classic ones which is not well defined for Cauchy. Rank normalized relative efficiencies are slightly larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.
We check the indicator function I and lp__:
res <- monitor_simple(samp[, , 101:102])
cat('I: r_eff = ', round(res['I','zsreff'],2), '\n')
I: r_eff = 0.43
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.36
The relative efficiency for these are also much better than with the nominal parameterization.
We examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \leq x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\).
samp <- as.array(fit_alt3)
plot_local_reff(samp = samp, par = 51, nalpha = 20)
We examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(x < \hat{Q}_{\alpha})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.025,\ldots,0.975)\).
samp <- as.array(fit_alt3)
plot_quantile_reff(samp = samp, par = 51, nalpha = 40)
The relative efficiency in tails is worse than for the first alternative parameterization, although it’s still better than for the nominal model.
res <- monitor_simple(samp[, , 51:100])
which_min_eff <- which(res$zsreff == min(res$zsreff))
res1 <- monitor_simple(samp[, , 1:50])
cat('Mean r_eff for xs = ' , round(mean(res[,'zsreff']),2), '\n')
Mean r_eff for xs = 1.25
cat('Mean fr_eff for xs = ' , round(mean(res[,'fzsreff']),2), '\n')
Mean fr_eff for xs = 0.41
cat('Mean r_eff for x_as = ' , round(mean(res1[,'zsreff']),2), '\n')
Mean r_eff for x_as = 1.25
Result: When the transformation is univariate and bijective, the rank normalized Rhat and r_eff are transformation invariant.
Naturally when computing the desired expectations, the final target function should be taken into account, but we just want to avoid the complication of the potentially non-existing moments at this point, and come back later in other paper how to use Pareto diagnostic to check whether necessary moments exist to compute the desired expectations and their Monte Carlo errors.
Rank plot visualisation of x[13], which has the smallest relative efficiency of 1 among x’s.
xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])
Nothing special.
Rank plot visualisation of lp__, which has relative efficiency 0.33.
mcmc_hist_r_scale(samp[, , "lp__"])
Nothing special.
Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.
writeLines(readLines("half_cauchy_nom.stan"))
parameters {
vector<lower=0>[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the half-Cauchy with nominal parameterization (and positive constraint).
fit_half_nom <- stan(file='half_cauchy_nom.stan', seed=7878, refresh = 0)
There are no warnings, and the sampling is much faster than for the Cauchy nominal model.
samp <- as.array(fit_half_nom)
res <- monitor_simple(samp[, , 1:50])
plot_rhat(res)
All Rhats are below 1.01. Classic Rhat’s look also good even if the classic is not well defined for half-Cauchy distribution.
plot_reff(res)
Result: Rank normalized r_eff’s have less variation than classic one which is not well defined for Cauchy. Rank normalized r_eff’s are much larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.
Due to constraint <lower=0>, the sampling was made in log(x) space, and we can also check the performance in that space.
res <- monitor_simple(samp[, , 1:50])
plot_reff(res)
\(log(x)\) is quite close to Gaussian, and thus classic r_eff is also close to rank normalized r_eff which is exactly same as for \(x\) as the rank normalized version is invariant to bijective transformations.
Result: Rank normalized relative efficiency is close to classic relative efficiency for transformations which make the distribution close to Gaussian.
We check the indicator function I and lp__:
res <- monitor_simple(samp[, , 51:52])
which_min_eff <- which(res$zsreff == min(res$zsreff))
cat('I: r_eff = ', round(res['I','zsreff'],2), '\n')
I: r_eff = 1.33
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.27
Result: Relative efficiency for \(p(x_1 < 1)\) is larger than 1. This is possible for antithetic Markov chains, which have negative correlation for odd lags.
We examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\).
samp <- as.array(fit_half_nom)
plot_local_reff(samp = samp, par = 1, nalpha = 20)
The relative efficiency is good overall, with only a small dip in tails.
We examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.025,\ldots,0.975)\).
samp <- as.array(fit_half_nom)
plot_quantile_reff(samp = samp, par = 1, nalpha = 40)
Rank plot visualisation of x[2], which has the smallest relative efficiency 1.0 among x’s.
xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])
Looks like there are differences in scale of chains.
Rank plot visualisation of lp__, which has relative efficiency 0.33.
mcmc_hist_r_scale(samp[, , "lp__"])
Maybe small differences in scales, but it’s difficult to know whether small variation from uniform is relevant.
writeLines(readLines("half_cauchy_alt.stan"))
parameters {
vector<lower=0>[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a .* sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ inv_gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run half-Cauchy with alternative parameterization
fit_half_reparam <- stan(file = 'half_cauchy_alt.stan', seed=7878, refresh = 0)
There are no warnings, and the sampling is as fast for the half-Cauchy nominal model.
samp <- as.array(fit_half_reparam)
res <- monitor_simple(samp[, , 101:150])
plot_rhat(res)
All Rhats are below 1.01. Classic Rhat’s look also good even if the classic is not well defined for half-Cauchy distribution.
plot_reff(res)
Result: Rank normalized relative efficiencies have less variation than classic ones which is not well defined for Cauchy. Based on rank normalized relative efficiencies the alternative parameterization has much lower efficiency than the nominal parameterization with constraint <lower=0>.
We check the indicator function I and lp__:
res <- monitor_simple(samp[, , 151:152])
cat('I: r_eff = ', round(res['I','zsreff'],2), '\n')
I: r_eff = 0.78
cat('lp__: r_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: r_eff = 0.24
We examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(\hat{Q}_\alpha \le x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.05,\ldots,0.95)\).
samp <- as.array(fit_half_reparam)
plot_local_reff(samp = samp, par = 101, nalpha = 20)
We examine the relative efficiency in different parts of the posterior, by computing the relative efficiency for \(p(x < \hat{Q}_{\alpha+0.05})\), where \(\hat{Q}_\alpha\) is \(\alpha\)-quantile and \(\alpha \in (0,0.025,\ldots,0.975)\).
samp <- as.array(fit_half_reparam)
plot_quantile_reff(samp = samp, par = 1, nalpha = 40)
The relative efficiency near zero is much worse than for the half-Cauchy with nominal parameterization and constraint <lower=0>.
Rank plot visualisation of x[15], which has the smallest relative efficiency 0.30 among x’s.
mcmc_hist_r_scale(samp[, , "x[15]"])
The rank plot is different from uniform, which is expected with lower relative efficiency, but would require a reference.
Rank plot visualisation of lp__, which has relative efficiency 0.24.
mcmc_hist_r_scale(samp[, , "lp__"])
The rank plot is different from uniform, which is expected with lower relative efficiency, but would require a reference.
The models are from Michael Betancourt’s case study on Diagnosing Biased Inference with Divergences.
writeLines(readLines("eight_schools_cp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta[J];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}
Run centered parameterization model with default MCMC options.
input_data <- read_rdump("eight_schools.data.R")
fit_cp <- stan(
file='eight_schools_cp.stan', data=input_data,
iter=2000, chains=4, seed=483892929, refresh=0
)
Warning: There were 50 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
We do observe divergent transitions, which is sensitive diagnostic for this model and we observer also low BFMI, but let’s assume we were investigating a MCMC algorithm without such diagnostic.
sel <- c(
'sRhat', 'zsRhat', 'fzsRhat', 'reff',
'zsreff', 'fzsreff', 'medsreff', 'madsreff'
)
res <- monitor_simple(fit_cp)
round(res[, sel], 2)
sRhat zsRhat fzsRhat reff zsreff fzsreff medsreff madsreff
mu 1.01 1.01 1.01 0.16 0.16 0.20 0.20 0.25
tau 1.02 1.02 1.00 0.07 0.05 0.22 0.09 0.24
theta[1] 1.00 1.00 1.01 0.28 0.26 0.16 0.25 0.19
theta[2] 1.00 1.00 1.01 0.27 0.26 0.21 0.26 0.21
theta[3] 1.01 1.01 1.01 0.21 0.21 0.22 0.24 0.25
theta[4] 1.00 1.00 1.02 0.29 0.28 0.29 0.27 0.23
theta[5] 1.01 1.01 1.01 0.22 0.22 0.29 0.23 0.33
theta[6] 1.00 1.00 1.01 0.28 0.27 0.23 0.24 0.24
theta[7] 1.00 1.00 1.01 0.23 0.22 0.20 0.25 0.22
theta[8] 1.00 1.00 1.01 0.31 0.28 0.19 0.26 0.30
lp__ 1.03 1.03 1.01 0.04 0.05 0.16 0.08 0.23
Some rank-normalized split-Rhats are larger than 1.01. Relative efficiencies for tau and lp__ are less than 10%, which is worry-some and longer chains should be run.
We examine the relative efficiency of probabilities between quantiles (0, 0.05, …, 1) for tau.
samp <- as.array(fit_cp)
plot_local_reff(samp = samp, par = "tau", nalpha = 20)
We see that MCMC has difficulties in exploring small tau values. As the efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates.
We examine the relative efficiency of cumulative probabilities below quantiles \((0.025, ..., 0.975)\) for tau.
samp <- as.array(fit_cp)
plot_quantile_reff(samp = samp, par = 2, nalpha = 40)
Rank plot visualisation of tau, which has the smallest relative efficiency 0.05 among parameters.
mcmc_hist_r_scale(samp[, , "tau"])
We observe clear problems.
Rank plot visualisation of lp__, which has relative efficiency 0.05.
mcmc_hist_r_scale(samp[, , "lp__"])
We observe clear problems.
Low r_eff can be sometimes compensated with longer chains. Let’s check 10 times longer chain.
fit_cp2 <- stan(
file='eight_schools_cp.stan', data=input_data,
iter=20000, chains=4, seed=483892929, refresh=0
)
Warning: There were 1179 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
res <- monitor_simple(fit_cp2)
round(res[, sel], 2)
sRhat zsRhat fzsRhat reff zsreff fzsreff medsreff madsreff
mu 1.00 1.00 1.00 0.07 0.07 0.17 0.03 0.06
tau 1.01 1.03 1.00 0.03 0.01 0.15 0.03 0.04
theta[1] 1.00 1.00 1.01 0.22 0.20 0.05 0.05 0.03
theta[2] 1.00 1.00 1.00 0.25 0.22 0.12 0.03 0.03
theta[3] 1.00 1.00 1.00 0.13 0.11 0.12 0.03 0.03
theta[4] 1.00 1.00 1.00 0.26 0.22 0.06 0.04 0.03
theta[5] 1.00 1.00 1.00 0.11 0.09 0.16 0.03 0.06
theta[6] 1.00 1.00 1.00 0.16 0.15 0.13 0.03 0.03
theta[7] 1.00 1.00 1.00 0.18 0.16 0.08 0.05 0.03
theta[8] 1.00 1.00 1.00 0.26 0.21 0.07 0.03 0.03
lp__ 1.02 1.03 1.01 0.01 0.01 0.01 0.04 0.05
Some rank-normalized split-Rhats are still larger than 1.01. Relative efficiencies for tau and lp__ are around 1%. A drop in the relative efficiency when increasing the number of iterations indicates serious problems in mixing.
We examine the relative efficiency of probabilities between quantiles \((0, 0.05, ..., 1)\) for tau.
samp <- as.array(fit_cp2)
plot_local_reff(samp = samp, par = 2, nalpha = 50)
We see that MCMC has difficulties in exploring small tau values. As the efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates.
We examine the relative efficiency of cumulative probabilities below quantiles \((0.01, ..., 0.99)\) for tau.
samp <- as.array(fit_cp2)
plot_quantile_reff(samp = samp, par = 2, nalpha = 100)
We examine also other parameters, like mu in the following plot.
samp <- as.array(fit_cp2)
plot_local_reff(samp = samp, par = 1, nalpha = 50)
There are gaps of poor efficiency which indicates sticking of MCMC, but the sticking doesn’t occur for any specific range of values of mu.
We examine also other parameters, like mu in the following plot.
samp <- as.array(fit_cp2)
plot_quantile_reff(samp = samp, par = 1, nalpha = 100)
Rank plot visualisation of tau, which has the smallest relative efficiency 0.01 among parameters.
mcmc_hist_r_scale(samp[, , "tau"])
We observe clear problems to sample small values of tau.
Rank plot visualisation of lp__, which has relative efficiency 0.01.
mcmc_hist_r_scale(samp[, , "lp__"])
We observe clear problems sampling different energy levels.
And for further evidence let’s check 100 times longer chains than the default.
fit_cp3 <- stan(
file='eight_schools_cp.stan', data=input_data,
iter=200000, chains=4, seed=483892929, refresh=0
)
Warning: There were 16912 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
res <- monitor_simple(fit_cp3)
round(res[, sel], 2)
sRhat zsRhat fzsRhat reff zsreff fzsreff medsreff madsreff
mu 1.00 1.00 1.00 0.04 0.03 0.01 0.01 0.01
tau 1.00 1.01 1.00 0.01 0.00 0.02 0.01 0.01
theta[1] 1.00 1.00 1.00 0.03 0.03 0.06 0.01 0.01
theta[2] 1.00 1.00 1.00 0.05 0.04 0.01 0.01 0.01
theta[3] 1.00 1.00 1.00 0.15 0.09 0.01 0.01 0.01
theta[4] 1.00 1.00 1.00 0.08 0.06 0.01 0.01 0.01
theta[5] 1.00 1.00 1.00 0.11 0.09 0.01 0.01 0.01
theta[6] 1.00 1.00 1.00 0.13 0.09 0.00 0.01 0.01
theta[7] 1.00 1.00 1.00 0.02 0.02 0.10 0.01 0.01
theta[8] 1.00 1.00 1.00 0.08 0.05 0.01 0.01 0.01
lp__ 1.01 1.01 1.01 0.00 0.00 0.00 0.01 0.01
Some rank-normalized split-Rhats are still larger than 1.01. Relative efficiencies for tau and lp__ are around 0% and relative efficiencies for other parameters are also getting smaller. Drop in the relative efficiency when the the sample size was increased indicates serious problems in mixing.
We examine the relative efficiency of probabilities between quantiles \((0, 0.05, ..., 1)\) for tau.
samp <- as.array(fit_cp3)
plot_local_reff(samp = samp, par = 2, nalpha = 100)
We see that MCMC has difficulties in exploring small tau values. As the efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates. It is good to note that the low efficiency for small tau values is due to too large step size, and the low efficiency for large tau values is partially due to too small step size, but even more importantly due to the random walk in energy space.
We examine the relative efficiency of cumulative probabilities below quantiles (0.01, …, 0.99) for tau.
samp <- as.array(fit_cp3)
plot_quantile_reff(samp = samp, par = 2, nalpha = 100)
Rank plot visualisation of tau, which has the smallest relative efficiency 0.00 among parameters.
mcmc_hist_r_scale(samp[, , 2])
We observe clear problems to sample small values of tau and even with 100000 draws per chain, the plots don’t get crowded as traceplots would.
Rank plot visualisation of lp__, which has relative efficiency 0.01.
mcmc_hist(r_scale(samp[,,11]), breaks = seq(0,prod(dim(samp)[1:2]),by=prod(dim(samp)[1:2])/100)+0.5)
We observe clear problems sampling different energy levels, and the low efficiency for lp__ comes just from slow random walk (autocorrelation for chains 1 and 2 goes to zero after 1000 lags, and for chains 3 and 4 after 5000 lags).
For hierarchical models, non-centered parameterization often works better
writeLines(readLines("eight_schools_ncp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta_tilde[J];
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * theta_tilde[j];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta_tilde ~ normal(0, 1);
y ~ normal(theta, sigma);
}
Run non-centered parameterization model with default options.
input_data <- read_rdump("eight_schools.data.R")
fit_ncp <- stan(
file='eight_schools_ncp.stan', data=input_data,
iter=2000, chains=4, seed=483892929, refresh=0
)
Warning: There were 3 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems
We still observe some divergent transitions with the default adapt_delta and no divergences with adapt_delta=0.95. Let’s analyze both samples.
res <- monitor_simple(fit_ncp)
round(res[, sel], 2)
sRhat zsRhat fzsRhat reff zsreff fzsreff medsreff madsreff
mu 1 1 1 1.06 1.07 0.51 1.12 0.56
tau 1 1 1 0.66 0.57 0.70 0.83 0.77
theta_tilde[1] 1 1 1 1.27 1.29 0.49 1.25 0.53
theta_tilde[2] 1 1 1 1.33 1.35 0.54 1.23 0.58
theta_tilde[3] 1 1 1 1.33 1.34 0.47 1.28 0.58
theta_tilde[4] 1 1 1 1.29 1.32 0.50 1.26 0.63
theta_tilde[5] 1 1 1 1.25 1.25 0.40 1.11 0.53
theta_tilde[6] 1 1 1 1.28 1.30 0.47 1.27 0.60
theta_tilde[7] 1 1 1 1.13 1.15 0.56 1.22 0.59
theta_tilde[8] 1 1 1 1.48 1.50 0.52 1.41 0.58
theta[1] 1 1 1 0.95 1.02 0.60 1.05 0.69
theta[2] 1 1 1 1.07 1.11 0.59 1.14 0.67
theta[3] 1 1 1 1.05 1.13 0.63 1.19 0.63
theta[4] 1 1 1 1.12 1.16 0.61 1.12 0.66
theta[5] 1 1 1 1.15 1.19 0.55 1.25 0.63
theta[6] 1 1 1 0.99 1.05 0.65 1.11 0.69
theta[7] 1 1 1 0.97 1.07 0.62 1.12 0.76
theta[8] 1 1 1 1.12 1.18 0.64 1.29 0.73
lp__ 1 1 1 0.36 0.38 0.64 0.47 0.84
All Rhats are close to 1, and relative efficiencies are good.
We examine the relative efficiency of probabilities between quantiles \((0, 0.05, ..., 1)\) for tau.
samp <- as.array(fit_ncp)
plot_local_reff(samp = samp, par = 2, nalpha = 20)
Small tau values are still more difficult to explore, but the relative efficiency is in a good range.
We examine the relative efficiency of probabilities below quantiles (0.025, …, 0.975) for tau.
samp <- as.array(fit_ncp)
plot_quantile_reff(samp = samp, par = 2, nalpha = 40)
Rank plot visualisation of tau, which has the smallest relative efficiency 0.57 among parameters.
mcmc_hist_r_scale(samp[, , 2])
We can see problems in sampling small values of tau.
Rank plot visualisation of lp__, which has relative efficiency 0.01.
mcmc_hist_r_scale(samp[, , 19])
We observe clear problems sampling different energy levels.
adapt_delta=0.95Next we examine the same model but with higher adapt_delta=0.95.
fit_ncp2 <- stan(
file='eight_schools_ncp.stan', data=input_data,
iter=2000, chains=4, control=list(adapt_delta = 0.95),
seed=483892929, refresh=0
)
We get zero divergences with adapt_delta = 0.95.
res <- monitor_simple(fit_ncp2)
round(res[, sel], 2)
sRhat zsRhat fzsRhat reff zsreff fzsreff medsreff madsreff
mu 1 1 1 1.15 1.17 0.53 1.13 0.61
tau 1 1 1 0.83 0.62 0.77 0.82 0.70
theta_tilde[1] 1 1 1 1.09 1.11 0.54 1.14 0.63
theta_tilde[2] 1 1 1 1.16 1.17 0.45 1.05 0.56
theta_tilde[3] 1 1 1 1.23 1.23 0.45 1.16 0.56
theta_tilde[4] 1 1 1 1.26 1.28 0.45 1.26 0.58
theta_tilde[5] 1 1 1 1.15 1.16 0.49 1.17 0.56
theta_tilde[6] 1 1 1 1.29 1.29 0.52 1.19 0.59
theta_tilde[7] 1 1 1 1.10 1.11 0.55 1.09 0.64
theta_tilde[8] 1 1 1 1.19 1.19 0.53 1.22 0.60
theta[1] 1 1 1 0.99 1.07 0.66 1.13 0.72
theta[2] 1 1 1 1.24 1.29 0.58 1.21 0.66
theta[3] 1 1 1 1.07 1.13 0.68 1.07 0.73
theta[4] 1 1 1 1.24 1.27 0.67 1.27 0.73
theta[5] 1 1 1 1.17 1.20 0.72 1.08 0.77
theta[6] 1 1 1 1.11 1.18 0.63 1.21 0.64
theta[7] 1 1 1 1.18 1.22 0.68 1.17 0.74
theta[8] 1 1 1 1.03 1.10 0.69 1.15 0.71
lp__ 1 1 1 0.44 0.44 0.64 0.51 0.70
All Rhats are close to 1, and relative efficiencies are good and slightly better than with the default adapt_delta.
We examine the relative efficiency of probabilities between quantiles \((0, 0.05, ..., 1)\) for tau.
samp <- as.array(fit_ncp2)
plot_local_reff(samp = samp, par = 2, nalpha = 20)
Small tau values are still more difficult to explore, but the relative efficiency is in a good range.
Check with a finer resolution
samp <- as.array(fit_ncp2)
plot_local_reff(samp = samp, par = 2, nalpha = 50)
Also with a finer resolution the efficiency of MCMC is good for different values of tau.
We examine the relative efficiency of probabilities below quantiles \((0.025, ..., 0.975)\) for tau.
samp <- as.array(fit_ncp2)
plot_quantile_reff(samp = samp, par = 2, nalpha = 40)
Rank plot visualisation of tau, which has the smallest relative efficiency 0.62 among parameters.
mcmc_hist_r_scale(samp[, , 2])
Higher adapt_delta seems to have improved sampling of small values.
Rank plot visualisation of lp__, which has relative efficiency 0.44.
mcmc_hist_r_scale(samp[, , 19])
We observe clear problems sampling different energy levels.
If in doubt, we can run longer chains.
input_data <- read_rdump("eight_schools.data.R")
fit_ncp3 <- stan(
file='eight_schools_ncp.stan', data=input_data,
iter=20000, chains=4, control=list(adapt_delta = 0.954),
seed=483892929, refresh=0
)
res <- monitor_simple(fit_ncp3)
round(res[, sel], 2)
sRhat zsRhat fzsRhat reff zsreff fzsreff medsreff madsreff
mu 1 1 1 1.09 1.10 0.51 1.11 0.62
tau 1 1 1 0.87 0.65 0.78 0.86 0.75
theta_tilde[1] 1 1 1 1.21 1.21 0.54 1.15 0.63
theta_tilde[2] 1 1 1 1.22 1.23 0.49 1.19 0.56
theta_tilde[3] 1 1 1 1.20 1.20 0.48 1.17 0.55
theta_tilde[4] 1 1 1 1.23 1.23 0.50 1.23 0.58
theta_tilde[5] 1 1 1 1.25 1.26 0.53 1.18 0.62
theta_tilde[6] 1 1 1 1.23 1.23 0.48 1.21 0.56
theta_tilde[7] 1 1 1 1.15 1.16 0.53 1.12 0.62
theta_tilde[8] 1 1 1 1.31 1.31 0.47 1.22 0.56
theta[1] 1 1 1 1.07 1.12 0.66 1.10 0.71
theta[2] 1 1 1 1.19 1.21 0.65 1.15 0.70
theta[3] 1 1 1 1.04 1.08 0.63 1.11 0.70
theta[4] 1 1 1 1.13 1.15 0.64 1.14 0.68
theta[5] 1 1 1 1.21 1.21 0.65 1.14 0.72
theta[6] 1 1 1 1.15 1.19 0.64 1.13 0.70
theta[7] 1 1 1 1.12 1.12 0.68 1.08 0.74
theta[8] 1 1 1 1.07 1.11 0.60 1.08 0.69
lp__ 1 1 1 0.41 0.41 0.64 0.51 0.75
All Rhats are close to 1, and relative efficiencies are good and slightly better than with the default adapt_delta. Relative efficiency is similar as for shorter chain, which means that running longer chains is giving us higher effective sample size.
We examine the relative efficiency of probabilities between quantiles \((0, 0.05, ..., 1)\) for tau.
samp <- as.array(fit_ncp3)
plot_local_reff(samp = samp, par = "tau", nalpha = 100)
Small tau values are still more difficult to explore, but the relative efficiency is in a good range.
We examine the relative efficiency of probabilities below quantiles \((0.01, ..., 0.99)\) for tau.
samp <- as.array(fit_ncp2)
plot_quantile_reff(samp = samp, par = "tau", nalpha = 100)
Rank plot visualisation of tau, which has the smallest relative efficiency of 0.62 among parameters.
mcmc_hist_r_scale(samp[, , "tau"])
Higher adapt_delta seems to have improved sampling of small values.
Rank plot visualisation of lp__, which has a relative efficiency of 0.44.
mcmc_hist_r_scale(samp[, , "lp__"])
We observe clear problems sampling different energy levels.
We have already seen that the relative efficiency of HMC/NUTS can be higher than with independent draws. The next example illustrates interesting relative efficiency phenomena due to properties of the HMC/NUTS algorithm.
writeLines(readLines("normal.stan"))
data {
int<lower=1> J;
}
parameters {
vector[J] x;
}
model {
x ~ normal(0, 1);
}
fit_n <- stan(
file='normal.stan', data=data.frame(J=16),
iter=20000, chains=4, seed=483892929, refresh=0
)
samp <- as.array(fit_n)
res <- monitor_simple(samp)
round(res[, sel], 2)
sRhat zsRhat fzsRhat reff zsreff fzsreff medsreff madsreff
x[1] 1 1 1 2.53 2.55 0.39 2.01 0.45
x[2] 1 1 1 2.67 2.68 0.41 2.01 0.48
x[3] 1 1 1 2.54 2.56 0.41 2.05 0.46
x[4] 1 1 1 2.53 2.54 0.40 2.10 0.48
x[5] 1 1 1 2.49 2.50 0.41 2.00 0.47
x[6] 1 1 1 2.51 2.53 0.42 2.09 0.50
x[7] 1 1 1 2.49 2.50 0.39 2.20 0.45
x[8] 1 1 1 2.64 2.65 0.40 2.01 0.45
x[9] 1 1 1 2.55 2.56 0.41 2.04 0.49
x[10] 1 1 1 2.62 2.63 0.40 2.03 0.47
x[11] 1 1 1 2.54 2.55 0.41 1.99 0.48
x[12] 1 1 1 2.50 2.52 0.40 2.01 0.47
x[13] 1 1 1 2.59 2.60 0.41 2.08 0.49
x[14] 1 1 1 2.49 2.51 0.41 2.11 0.48
x[15] 1 1 1 2.58 2.59 0.40 2.13 0.46
x[16] 1 1 1 2.68 2.69 0.41 2.06 0.48
lp__ 1 1 1 0.36 0.35 0.53 0.42 0.59
The relative efficiency for all x is larger than 2.5. The relative efficiency for lp__ is however only 0.35. If we take a look at all the Stan examples in this notebook, we see that the relative efficiency for lp__ is always below 0.5. lp__ correlates strongly with the total energy in HMC, and that total energy is sampled using random walk proposal once per iteration and thus it’s likely that lp__ has some random walk behavior leading to autocorrelation and relative efficiency below 1. On the other hand, No-U-Turn behavior and sampling from the trajectory may make the Markov chain to be antithetic with negative odd lag correlation and relative efficiency higher than 1 for some of its parameters-
Let’s check the local relative efficiency.
plot_local_reff(samp = samp, par = 1, nalpha = 100)
The relative efficiency for probability estimate for a small interval is close to 1 with a slight drop in the tails. This is a good result, but far from the relative efficiency for mean estimate.
Let’s check the relative efficiency of cumulative probabilities.
plot_quantile_reff(samp = samp, par = 1, nalpha = 100)
The total energy of HMC should affect how far in the tails a chain in one iteration can go. Fat tails of the target have high energy, and thus only chains with high total energy can reach there. This will suggest that the random walk in total energy would cause random walk in variance of x. Let’s check the second moment of x.
samp <- as.array(fit_n, pars = "x")^2
res <- monitor_simple(samp)
round(res[, sel], 2)
sRhat zsRhat fzsRhat reff zsreff fzsreff medsreff madsreff
x[1] 1 1 1 0.36 0.39 0.47 0.45 0.60
x[2] 1 1 1 0.37 0.41 0.50 0.48 0.63
x[3] 1 1 1 0.38 0.41 0.48 0.46 0.60
x[4] 1 1 1 0.37 0.40 0.47 0.48 0.60
x[5] 1 1 1 0.37 0.41 0.47 0.47 0.59
x[6] 1 1 1 0.37 0.42 0.47 0.50 0.61
x[7] 1 1 1 0.35 0.39 0.46 0.46 0.58
x[8] 1 1 1 0.36 0.40 0.46 0.46 0.57
x[9] 1 1 1 0.38 0.41 0.48 0.48 0.60
x[10] 1 1 1 0.36 0.40 0.47 0.47 0.61
x[11] 1 1 1 0.37 0.41 0.48 0.48 0.59
x[12] 1 1 1 0.36 0.40 0.47 0.47 0.57
x[13] 1 1 1 0.36 0.41 0.48 0.49 0.62
x[14] 1 1 1 0.37 0.41 0.47 0.48 0.61
x[15] 1 1 1 0.37 0.40 0.47 0.46 0.60
x[16] 1 1 1 0.36 0.41 0.47 0.48 0.57
Mean of the relative efficiencies for \(x_j^2\) is 0.4, which is quite close to the relative efficiency for lp__. This is not that surprising as the potential energy in normal model is relative to \(\sum_{j=1}^J x_j^2\).
Let’s check the local relative efficiency
plot_local_reff(samp = samp, par = 1, nalpha = 100)
The relative efficiency is mostly a bit below 1, but for the right tail of \(x_1^2\) the relative efficiency drops. This likely due to only some iterations have high enough total energy to obtain draws from the high energy part of the tail.
Let’s check the relative efficiency of cumulative probabilities
plot_quantile_reff(samp = samp, par = 1, nalpha = 100)
We can see the correlation between lp__ and magnitude of x[1] in the following plot.
samp <- as.array(fit_n)
qplot(
as.vector(samp[, , "lp__"]),
abs(as.vector(samp[, , "x[1]"]))
) +
xlab('lp__') +
ylab('x[1]')
Low lp__ corresponds to high energy and more variation in x[1], and high lp__ corresponds to low energy and small variation in x[1]. Finally \(\sum_{j=1}^J x_j^2\) is perfectly correlated with lp__.
qplot(
as.vector(samp[, , "lp__"]),
as.vector(apply(samp[, , 1:16]^2, 1:2, sum))
) +
xlab('lp__') +
ylab('sum(x^2)')
This shows that even if we get high relative efficiency estimates for parameters, it is important to look at the relative efficiency of lp__ as it can indicate problems of sampling in tails, which would affect the accuracy of variance and tail quantile estimates.
Brooks, S. P. and Gelman, A. (1998) ‘General methods for monitoring convergence of iterative simulations’, Journal of Computational and Graphical Statistics, 7(4), pp. 434–455.
Carpenter, B. et al. (2017) ‘Stan: A probabilistic programming language’, Journal of Statistical Software, Articles, 76(1), pp. 1–32. doi: 10.18637/jss.v076.i01.
Gelman, A. and Rubin, D. B. (1992) ‘Inference from iterative simulation using multiple sequences’, Statistical science, 7(4), pp. 457–472.
Gelman, A. et al. (2013) Bayesian data analysis, third edition. CRC Press.
Geyer, C. J. (1992) ‘Practical Markov chain Monte Carlo’, Statistical Science, 7, pp. 473–483.
Geyer, C. J. (2011) ‘Introduction to Markov chain Monte Carlo’, in Brooks, S. et al. (eds) Handbook of markov chain monte carlo. CRC Press.
Hoffman, M. D. and Gelman, A. (2014) ‘The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo’, Journal of Machine Learning Research, 15, pp. 1593–1623. Available at: http://jmlr.org/papers/v15/hoffman14a.html.
Kucukelbir, A. et al. (2017) ‘Automatic differentiation variational inference’, The Journal of Machine Learning Research, 18, pp. 430–474. Available at: http://jmlr.org/papers/v18/16-107.html.
Stan Development Team (2018a) ‘RStan: The R interface to Stan. R package version 2.17.3’. Available at: http://mc-stan.org.
Stan Development Team (2018b) ‘RStanArm: Bayesian applied regression modeling via Stan. R package version 2.17.4’. Available at: http://mc-stan.org.
Stan Development Team (2018c) ‘The Stan core library version 2.18.0’. Available at: http://mc-stan.org.
Vehtari, A., Gelman, A. and Gabry, J. (2017) ‘Pareto smoothed importance sampling’, arXiv preprint arXiv:1507.02646. Available at: https://arxiv.org/abs/1507.02646.
writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXXFLAGS+=-flto -ffat-lto-objects -Wno-unused-local-typedefs
CXXFLAGS+=-std=c++11
CFLAGS+=-O3
devtools::session_info("rstan")
Session info -------------------------------------------------------------
setting value
version R version 3.5.1 (2018-07-02)
system x86_64, linux-gnu
ui X11
language en_GB:en
collate en_US.UTF-8
tz Europe/Helsinki
date 2018-08-17
Packages -----------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.5.1)
BH 1.66.0-1 2018-02-13 CRAN (R 3.5.1)
cli 1.0.0 2017-11-05 CRAN (R 3.5.1)
colorspace 1.3-2 2016-12-14 CRAN (R 3.5.1)
crayon 1.3.4 2017-09-16 CRAN (R 3.5.1)
dichromat 2.0-0 2013-01-24 CRAN (R 3.5.1)
digest 0.6.15 2018-01-28 CRAN (R 3.5.1)
fansi 0.2.3 2018-05-06 CRAN (R 3.5.1)
ggplot2 * 3.0.0 2018-07-03 CRAN (R 3.5.1)
glue 1.3.0 2018-07-17 CRAN (R 3.5.1)
graphics * 3.5.1 2018-07-03 local
grDevices * 3.5.1 2018-07-03 local
grid 3.5.1 2018-07-03 local
gridExtra * 2.3 2017-09-09 CRAN (R 3.5.1)
gtable 0.2.0 2016-02-26 CRAN (R 3.5.1)
inline 0.3.15 2018-05-18 CRAN (R 3.5.1)
labeling 0.3 2014-08-23 CRAN (R 3.5.1)
lattice 0.20-35 2017-03-25 CRAN (R 3.5.1)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.1)
magrittr 1.5 2014-11-22 CRAN (R 3.5.1)
MASS 7.3-50 2018-04-30 CRAN (R 3.5.0)
Matrix 1.2-14 2018-04-09 CRAN (R 3.5.1)
methods * 3.5.1 2018-07-03 local
mgcv 1.8-24 2018-06-18 CRAN (R 3.5.0)
munsell 0.5.0 2018-06-12 CRAN (R 3.5.1)
nlme 3.1-137 2018-04-07 CRAN (R 3.5.1)
pillar 1.3.0 2018-07-14 CRAN (R 3.5.1)
plyr 1.8.4 2016-06-08 CRAN (R 3.5.1)
R6 2.2.2 2017-06-17 CRAN (R 3.5.1)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.5.1)
Rcpp 0.12.18 2018-07-23 CRAN (R 3.5.1)
RcppEigen 0.3.3.4.0 2018-02-07 CRAN (R 3.5.1)
reshape2 1.4.3 2017-12-11 CRAN (R 3.5.1)
rlang 0.2.1 2018-05-30 CRAN (R 3.5.1)
rstan * 2.17.3 2018-01-20 CRAN (R 3.5.1)
scales 0.5.0 2017-08-24 CRAN (R 3.5.1)
StanHeaders * 2.17.2 2018-01-20 CRAN (R 3.5.1)
stats * 3.5.1 2018-07-03 local
stats4 3.5.1 2018-07-03 local
stringi 1.2.4 2018-07-20 CRAN (R 3.5.1)
stringr * 1.3.1 2018-05-10 CRAN (R 3.5.1)
tibble * 1.4.2 2018-01-22 CRAN (R 3.5.1)
tools 3.5.1 2018-07-03 local
utf8 1.1.4 2018-05-24 CRAN (R 3.5.1)
utils * 3.5.1 2018-07-03 local
viridisLite 0.3.0 2018-02-01 CRAN (R 3.5.1)
withr 2.1.2 2018-03-15 CRAN (R 3.5.1)